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Abstract 

A concurrent object is a data structure shared by concurrent processes. A 
wait-free implementation of a concurrent object guarantees that every operation 
completes in a finite number of steps, regardless of how processes interleave. It 
is known, however, that if concurrent processes communicate only by applying 
read and write operations to a shared memory, then it is impossible to con- 
struct wait-free implementations of many simple and useful data objects. In 
this paper we show how to construct randomized wait-free implementations of 
long-lived concurrent objects, implementations that guarantee that every oper- 
ation completes in a finite expected number of steps, even against a powerful 
adversary. 

This paper will appear in the Tenth Annual ACM Symposium on Principles 
of Distributed Computing, August 19-21, Montreal, Canada. 
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1 Introduction 

A concurrent object is a data structure shared by asynchronous concurrent pro- 
cesses. An implementation of a concurrent object is wait-free if it guarantees 
that every operation completes a finite number of steps, regardless of how pro- 
cess steps are interleaved. An implementation is randomized wait-free if it guar- 
antees that every operation completes in a finite expected number of steps. The 
wait-free condition guarantees that no process can be prevented from completing 
an operation by variations in other processes' speeds, or by undetected halting 
failures. This condition rules out many conventional algorithmic techniques 
such as busy-waiting, conditional waiting, critical sections, or barrier synchro- 
nization, since the failure or delay of a single process can prevent the non-faulty 
processes from making progress. 

In this paper, we propose new algorithms for constructing randomized wait- 
free implementations of arbitrary objects in an architecture where processes 
communicate by applying read and write operations to locations in shared mem- 
ory. Randomization is necessary because this kind of architecture does not 
support wait-free solutions to many fundamental problems [3, 11, 13, 17, 22], 
including basic decision problems such as consensus [14], and implementations 
of many simple data objects such as sets, queues, or lists. We give a general 
algorithm for constructing a randomized implementation of any read-modify- 
write [19] operation. Our construction uses no unbounded registers, and has 
worst-case time and space complexity identical to that of the best randomized 
consensus protocols known for this model [4, 7, 27]. 

The principal contribution of this work is to show how techniques developed 
for short-lived decision problems can be adapted to long-lived data objects. 
Much of the work on randomized wait-free synchronization algorithms concerns 
decision problems, such as consensus, in which a protocol is run only once. Each 
process simply executes its part of the protocol and halts, and the shared data 
structures do not need to be reused. By contrast, practical applications such 
as operating systems and data bases are organized around long-lived data ob- 
jects, which are inherently more difficult than decision problems. A data object 
has an unbounded lifetime during which each process can execute an arbitrary 
sequence of operations. Unlike a decision protocol, an object implementation 
must ensure that the size of the object's representation remains bounded even 
when the number of operations applied to it increases without limit. It must 
retain enough information to ensure that "sleepy" processes that arbitrarily sus- 
pend and resume execution can continue to progress, while discarding enough 
information to keep the object size bounded. A wait-free object implementa- 
tion must also guard against starvation, since one operation can be "overtaken" 
by an arbitrary sequence of other operations, a problem that does not arise in 
decision protocols. 
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2 Preliminaries 

A concurrent system consists of a collection of n processes that communicate 
through shared typed objects. Processes are sequential — each process applies 
a sequence of operations to objects, alternately issuing an invocation and then 
receiving the associated response. We make no fairness assumptions about pro- 
cesses. A process can halt, or display arbitrary variations in speed. In particular, 
one process cannot tell whether another has halted or is just running very slowly. 

Objects are data structures in memory. Each object has a type, which defines 
a set of possible values and a set of primitive operations that provide the only 
means to manipulate that object. Each object has a sequential specification 
that defines how the object behaves when its operations are invoked one at a 
time by a single process. For example, the behavior of a queue object can be 
specified by requiring that enq insert an item in the queue, and that deq remove 
the oldest item in the queue. In a concurrent system, however, an object's 
operations can be invoked by concurrent processes, and it is necessary to give 
a meaning to interleaved operation executions. An object is linearizable [18] 
if each operation appears to take effect instantaneously at some point between 
the operation's invocation and response. Linearizability implies that processes 
appear to be interleaved at the granularity of complete operations, and that the 
order of non-overlapping operations is preserved. 

We focus on an asynchronous multiple instruction/multiple data (MIMD) 
architecture in which shared memory consists of a sequence of locations, called 
registers, that can be written by (at least) one process and read by any pro- 
cess. Our time and space complexity measures are expressed in terms of read 
and write operations on registers of size 0(n). Polynomial-time algorithms 
for implementing large single- writer/multi-reader registers from small ones are 
well-known [10, 20, 21, 24, 25, 28]. 

A read-modify-write operation [19] is defined as follows. Let x be an object, 
v its current value, and / a function from values to values. The operation 
RMW(x, /) atomically replaces the value of x with f(v), and returns v (Figure 
1). Although read-modify-write operations implemented in hardware typically 
work on single words of memory, software implementations such as the one 
given here can work on objects of arbitrary size. The class of read-modify-write 
operations is universal, in the sense that one can implement any operation 
(such as enq or deq) from a suitable read-modify-write operation. A second 
approach to universality is to focus on a particular read-modify-write operation, 
such as compare&swap, shown in Figure 2. Elsewhere [15], we give a detailed 
methodology for transforming a stylized sequential implementation of any object 
into a wait-free linearizable implementation in which processes synchronize using 
only read, write, and compare&swap operations. 

Our construction also makes use of an atomic snapshot scan algorithm. 
Given an n-element array A, where P is the only process that writes ^4[.P], 
the snapshot scan makes an instantaneous (linearizable) copy of A. Atomic 
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RMW(r: register, /: function) returns(value) 
previous := r 
r := /(r) 
return previous 
end RMW 



Figure 1: A Read-Modify- Write Operation 

compare&swap(w: word, old, new: value) 
returns (boolean) 
if w = old 

then w := new 

return true 
else return false 
end if 
end compare&swap 



Figure 2: The Compare&Swap Operation 

snapshot scan algorithms have been proposed by Anderson [2] (bounded regis- 
ters and exponential running time), Aspnes and Herlihy [6] (unbounded registers 
and 0(n 2 ) running time), and by Afek et al. (bounded registers and 0(n 2 ) run- 
ning time). Of these, the last algorithm is the best suited for our purposes, since 
it is both bounded and efficient. 

3 Randomized Consensus 

The heart of our construction is a binary consensus ■protocol, in which each of 
n asynchronous processes starts with a preference taken from a two-element set 
(typically {0, 1}), and runs until it chooses a decision value and halts. The proto- 
col is correct if it is consistent: no two processes choose different decision values, 
valid: the decision value is some process's preference, and randomized wait-free: 
each process decides after a finite expected number of steps. When computing 
a protocol's expected number of steps, we assume that scheduling decisions are 
made by an adversary with unlimited resources and complete knowledge of the 
processes' protocols, their internal states, and the state of the shared memory. 
The adversary cannot, however, predict future coin flips. 

Our technique can be applied to a variety of randomized binary consensus 
protocols, but to keep our discussion as concrete as possible, we focus on the 
simplest such protocol, due to Aspnes [4], shown in Figure 3. Here, n processes 
collectively undertake a one-dimensional random walk centered at the origin 
with absorbing barriers at ±2n. The protocol makes use of three counters: Vq 
and V\ are the respective number of processes that prefer 0 and 1, and C is the 
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cursor for the random walk. Each process alternates between reading the three 
shared counters (using a single atomic scan) and updating C. Eventually the 
counter reaches one of the absorbing barriers, determining the decision value. 
While the counter is near the middle of the region, each process flips an unbiased 
coin to determine the direction in which to move the counter. If the counter is 
sufficiently close to one of the barriers, however, each process moves it determin- 
istically toward that barrier. Each shared counter (Figure 4) is implemented as 
an n-element array of integers, one per process. To increment or decrement the 
counter, a process updates its own field. To read the counter, it atomically scans 
all the fields. Careful use of modular arithmetic ensures that all values remain 
bounded. The expected running time of this protocol, expressed in primitive 
reads and writes, is 0(n 4 ), and the space complexity is 0(n 2 ). 

Any binary consensus protocol can be extended to a consensus protocol in 
which process take their preferences from an arbitrary set. Consider a log n- 
depth binary tree where each leaf is initially associated with a process. In the 
first round, each process performs binary consensus with its immediate neighbor, 
and each process adopts the resulting decision value as its next preference. At 
round r, two groups of 2 r_1 processes, each with a common preference, achieve 
binary consensus to decide the common preference for both groups at the next 
level. The protocol terminates after logn rounds of binary consensus, when all 
processes have a common preference. The simple inequalities 

logn logn 

£ 2 4i <(g2 i ) 4 <(2n) 4 

i=l i=l 

imply that the asymptotic expected running time for the multi- value consensus 
protocol is identical to that of binary consensus: 

logn 

0(^2 4i ) = 0(n 4 ) 

i = l 

The same property holds for any polynomial-time protocol. The space complex- 
ity increases from 0(n 2 ) to 0(n 2 log n). 

4 Formal Model 

A more complete version of our formal model appears elsewhere [17]. Formally, 
we model objects and processes as non-deterministic automata, a simplified form 
of the I/O automata of Lynch and Tuttle [23]. Each automaton has a set of 
states, sets of input, output, and internal events, and a transition relation given 
by a set of triples (s', e, s), where s and s' are states and e is an event. Such a 
triple is called a step, and it means that an automaton in state s' can undergo 
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shared data: 
Vq\ counter with range [0 . . .n] 
V\. counter with range [0 . . .n] 
C: counter with range [— 3n . . . 3n] 

consensus(prefer) returns(Boolean) 

inc (^prefer) 
loop 

po,pi,c := iead(T 0 ,Vi,C) 
select 

case c < — 2n do return 0 

case c > 2n do return 1 

case c < — (po + Pi) or pi = 0 do inc(C) 

case c < (po + pi) or po = 0 do dec(C) 

otherwise do 

ifflipO 

then inc(C) 
else dec(C) 
end if 
end select 
end loop 
end consensus 



Figure 3: A Randomized Binary Consensus Protocol 

a transition to state s, and that transition is associated with the event e. If 
(s' , e, s) is a step, we say that e is enabled in s' . Inputs cannot be disabled: for 
each input event e and each state s' , there exist a state s and a step (s' , e, s). 

An execution fragment of an automaton A is a finite sequence so, ei,si,... e n , s. 
or infinite sequence so, ei, si, . . . of alternating states and events such that each 
(si, ei + i, Si + i) is a step of A. An execution is an execution fragment where so 
is a starting state. A history fragment of an automaton is the subsequence of 
events occurring in an execution fragment, and a history is the subsequence 
occurring in an execution. 

Automata can be composed if they share no output or internal events. A 
state of the composed automaton S is a tuple of component states, and a starting 
state is a tuple of component starting states. The set of events of S is the union 
of the components' sets of events. The set of output events of S is the union 
of the components' sets of output events; the set of internal events is the union 
of the components' sets of internal events; and the set of input events of S is 
the set of input events of S that are not output events for some component. A 
triple (s',e,s) is in a step of S if and only if, for all component automata A, 
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[— r, r] is the range of the counter 
m is any integer greater than 2r + 1 

inc(C) 

v := C[P] 

write('u + 1 mod m, C[P]) 
end inc 

dec(C) 

v := C[P] 

write('u — 1 mod m, C[P]) 
end dec 

read(C) returns(integer) 
c := scan(C) 

v : = T,i=o C H 

return v' where — r < v' < r 

and v' = v (mod m) 

end read 



Figure 4: Counter Implementation 

one of the following holds: (1) e is an event of A, and the projection of the step 
onto A is a step of A, or (2) e is not an event of A, and ^4's state components 
are identical in s' and s. If H is a history of a composite automaton and A a 
component, then H\A denotes the subhistory of H consisting of events of A. 

Processes and objects are each modeled as automata. Operation invoca- 
tions are modeled as output events of processes, and input events of objects, 
while operation results are input events of processes and output events of ob- 
jects. To capture the notion that a process represents a single thread of con- 
trol, we say that a process history is well-formed if it begins with an invoca- 
tion and alternates matching invocations and responses. A concurrent system 
{Pi, . . . , P n ; Ai, . . . , A m } is an automaton composed from processes Pi, . . . , P n 
and objects Ai, . . . , A m , where processes and objects are composed by identify- 
ing corresponding invocations and responses. A history of a concurrent system is 
well- formed if each H\Pi is well-formed, and a concurrent system is well-formed 
if each of its histories is well-formed. Henceforth, we restrict our attention to 
well-formed concurrent systems. 

An execution is sequential if its first event is an invocation, and it alternates 
matching invocations and responses. A history is sequential if it is derived from 
a sequential execution. (Notice that a sequential execution permits process 
steps to be interleaved, but at the granularity of complete operations.) If we 
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restrict our attention to sequential histories, then the behavior of an object can 
be specified in a particularly simple way: by giving pre- and postconditions for 
each operation. We refer to such a specification as a sequential specification. 

If H is a history, let complete(H) denote the maximal subsequence of H 
consisting only of invocations and matching responses. Each history H in- 
duces a partial precedence order -<h on its operations: p -<h 9 if the response 
for p precedes the invocation for q. Operations unrelated by -<h are said to 
be concurrent. If H is sequential, -<h is a total order. A concurrent system 
{Pi, . . . , P„; Ai, . . . , A m } is linearizable if, each history H can be extended to a 
well-formed history H' , by adding zero or more responses, for each history H , 
there exists a sequential history S such that: 

• For all Pi, complete(H')\Pi = S\Pi 

In other words, the history "appears" sequential to each individual process, and 
this apparent sequential interleaving respects the real-time precedence ordering 
of operations. Equivalently, each operation appears to take effect instanta- 
neously at some point between its invocation and its response. A concurrent 
object A is linearizable [18] if, for every history H of every concurrent system 
{Pi, . . . , P„; Ai, . . . , Aj, . . . , A m }, H\Aj is linearizable. A linearizable object is 
thus "equivalent" to a sequential object, and its operations can also be speci- 
fied by simple pre- and postconditions. We restrict our attention to linearizable 
concurrent systems. 

An implementation of an object A is a concurrent system {Pi, . . ., F n ; R}, 
where the Fi are called front- ends, and R is called the representation. Informally, 
R is the data structure that implements A, and Fi is the procedure called by 
process P; to execute an operation. Each invocation of an operation of A is an 
input event of Fi, and each response is an output event of Fi. An implementation 
Ij of Aj is correct if the two systems are indistinguishable to the ensemble of 
processes: for every history H of {Pi, . . . , P n ', Ai, . . . , Ij , . . . , A m }, there exists 
a history H' of {Pi, . . . , P n ', Ai, . . . , Aj, . . . , A m }, such that H|{Pi, . . . , P„} = 
H'|{Pi,...,P n }. 

An implementation is wait-free [17] if: 

• It has no history in which an invocation of Pi remains pending across an 
infinite number of steps of Fi. 

• If Pi has a pending invocation in a state s, then there exists a history 
fragment starting from s, consisting entirely of events of Fi and R, that 
includes the response to that invocation. 

The first condition rules out unbounded busy-waiting: a front-end cannot 
take an infinite number of steps without responding to an invocation. The 
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second condition rules out conditional waiting: Fi cannot block waiting for 
another process to make a condition true. 

In this paper, the representation object R is an array of registers that provide 
linearizable read and write operations. We also use a scan procedure, which has a 
wait-free implementation using read and write, as well as a consensus procedure, 
which has a randomized wait-free implementation. For brevity, our algorithms 
are expressed in pseudocode, although it is straightforward to translate this 
notation into automata definitions. 

5 The Algorithm 

The processes execute a sequence of consensus protocols, called rounds, which 
determine the order in which concurrent operations are applied. We say a 
process wins round r if its preference is that round's decision value, otherwise it 
loses. A sleepy process is one that suspends execution in the middle of a round, 
and resumes after a later round has started. 

Recall that each shared counter C used by the consensus protocol is imple- 
mented as an array of (bounded) integer values, where each integer represents 
that process's contribution to the counter's value. A unbounded technique that 
permits different rounds to share the same data structures is simply to tag each 
counter register with the current round number. When a process reads a regis- 
ter, it ignores values tagged with earlier round numbers, treating those registers 
as uninitialized. When a sleepy process reads a value tagged with a later round 
number, it quits the protocol, since information it needs may have been over- 
written. The sleepy process then inspects other data structures to reconstruct 
whether it had won the interrupted round. 

To avoid unbounded round numbers, we introduce the notion of a (bounded) 
leadership graph, a concept adapted from the distance graph construction of 
Attiya, Dolev, and Shavit [7]. A leadership graph is a graph whose vertices 
are connected by directed or undirected edges. A path in a leadership graph 
from vertex vo to vertex V). is a sequence of vertices vo,...,Vk such that for 
each Vi,Vi + i, either there is an edge directed from Vi to ■Uj+i, or there is an 
undirected edge between them. A directed path is a path that includes at least 
one directed edge. We say that w is ahead of v (w > v) if there is a directed 
path from v to w, and that v and w are even (v ~ w) if neither is ahead of the 
other. Informally, if w is ahead of v, then w's consensus round preceded v's. 
A leader is a vertex with no directed paths to other vertices. A k- generation 
leadership graph is one where there are k vertices associated with each process. 
These k vertices will be connected by directed edges, and the last of these is the 
process's latest vertex. 

A vertex in a leadership graph is represented by a round vector, which is 
an n-element array of round counters. A round counter assumes values in the 
range {0, . . ., 2k}. Value a is considered "greater" than a — 1, . . .a — k, and 



5 THE ALGORITHM 



9 



"less" than a + 1, . . . , o + fc, where all arithmetic is modulo 2k +1. If v and v' 
are vectors associated with P and Q, then there is an edge directed from v to 
v' if v[Q] < and there is an undirected edge if v[Q] = v'[P]. Our protocol 

uses a 2-generation leadership graph. 

New round vectors are created by the advance operation, shown in Figure 
5. This procedure takes as arguments a leadership graph Q and a process P. 
It generates a new round vector for P (without modifying the graph itself). 
It erases any outgoing edges from P's most recent vector, adds incoming edges 
from any previously unrelated vectors, and leaves existing incoming edges alone. 

A recycling counter implementation appears in Figure 5. Each element of the 
C array is now an entry consisting of an integer value tagged with one or more 
round vectors. A process P reads the counter by scanning C, and reconstructing 
a leadership graph from the round vectors. If P's latest round vector is not a 
leader, then P is sleepy, and it exits the operation by signaling an exception. 
Otherwise, it sums and returns the contributions of the other processes whose 
most recent round vectors are leaders. 

The recycling consensus protocol is constructed by replacing the counter in 
Figure 3 with a recycling counter. This protocol allows a single data structure 
to be reused for multiple consensus protocols. A sleepy process that attempts 
to read the counters will receive an exception. In such a case, it is convenient 
to define the protocol to return the distinguished value _L. 

When a sleepy process discovers that it is no longer a leader, it must de- 
termine whether it won its last round. We incorporate a bounded amount of 
history information by associating a boolean toggle value with each operation. 
Each time a process starts a new operation, it complements its toggle bit. Each 
preference has an additional field: an n-element boolean horizon array. The 
toggle bit of the last round won by P is kept in p.horizon[P]. When a sleepy 
process P discovers it has been overtaken, it locates the winning preference p 
from any later round (as described below). P won the interrupted round if and 
only if its own toggle bit matches p.horizon[P]. 

A process starting a round must be able to reconstruct the object's current 
state. Our protocol maintains a 2-generation leadership graph Q in which each 
process P keeps a past vector from the last round it completed, and a present 
vector from the the last round it started. Each past vector is tagged with either 
the winning preference from that round or with the distinguished value _L. A 
leading past vector is one that has no directed paths leading to any other past 
vectors. The latest procedure scans Q, locates the leading past vectors, and 
returns any associated preference distinct from _L. 

The final issue concerns starvation. A naive approach is simply to use each 
consensus round to decide which operation is scheduled next. Such an algorithm 
is non-blocking, in the sense that the system as a whole continually makes 
progress, but it is not wait-free, since an individual process may starve if it loses 
every round. Instead, we use a form of software combining to guarantee that 
each operation completes in at most two rounds. Each process announces its 
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intention to execute an operation, and each process collects recently announced 
operations, and constructs its preference by applying them in sequence. 

We are now ready to present the complete protocol. We use the following 
data types and structures. An invocation is a record with two fields: a toggle bit 
and a function. A preference is a record with three fields: p. value is the object's 
new value if the process wins the consensus round, p .response an n-element 
array of values, and p. horizon an n-element array of toggle bits. 

An entry is a record with the following fields: 

• past is the round vector from the last round P completed. 

• winner is the winning preference from that round, or _L. 

• present is the round vector from the current round. This is the vector 
used by the consensus protocol to detect sleepy processes (as in Figure 5). 

• counters is a 3 • log n-element array of integers used for the counters needed 
by the recycling consensus protocol's random walks. 

The processes share an n-element array A of invocations, and an n-element array 
of entries Q. The entries' past and present fields define a 2-generation leadership 
graph. A leading past vector is one that has no directed paths leading to any 
other past vectors. We use two auxiliary procedures: scan takes an atomic 
snapshot scan of an arbitrary array, and latest scans Q, locates the leading past 
vectors, and returns an associated winner preference distinct from _L. 

The pseudo-code for RMW(f), where / is a function, appears in Figure 5. 
P first creates a new invocation, complements the previous invocation's toggle 
bit, and stores the new invocation in A. P reads its present vector from Q , and 
then calls latest to get the winning preference from the most recently completed 
round. It then calls make-prefer (Figure 5) to create a new preference. This 
procedure copies the last winning preference, and scans the A array. It then 
locates unapplied invocations by comparing the invocations' toggle bits with 
the corresponding bits in the preference's horizon array. It applies any new 
invocations, storing their responses in the new preference's response field, and 
their toggle bits in the horizon field. After creating the new preference, P then 
joins the recycling consensus protocol, returning either the winning preference 
or _L. P then creates a new entry for its next operation, setting past to the 
current value of present, winner to the outcome of the protocol (if known) or _L, 
present to the result of advance, and counters to an array of 3 -log n zeroes. After 
executing this loop twice, P calls latest to locate the latest winning preference, 
and extracts its response to its operation from the preference's response field. 

6 Correctness 

We give an explicit linearization: for any execution of the protocol, we construct 
an equivalent legal serial execution. Because of space limitations, some proofs 
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are omitted or sketched. 

Lemma 1 Being even is an equivalence relation: if u ~ v and v ~ w, then 
u ~ w. 

Proof: Every two vertices are joined by an edge, either directed or undirected, 
hence if u ~ v, then there is an undirected edge between them. Suppose u ~ v 
and v ~ w, but u < w. Then there is a directed path from v to w constructed 
by joining the undirected edge from v to u to the directed path from u to », 
contradicting the hypothesis that u ~ v. m 

Since reasoning about round numbers is easier than reasoning about round 
vectors, we introduce unbounded round numbers as auxiliary data, variables 
which do not affect the protocol's control flow. We tag every round vector 
v in Q with an unbounded round number v defined as follows. Initially, all 
round vectors have round number zero. Suppose when P calls advance, g is 
the scanned copy of Q, r is the highest round number in g, and rp is P's 
highest round number in g. If v is the round vector returned by advance, then 
v = max(rp + l,r). 

Lemma 2 // advance returns a vector with round number r, then for every 
integer between 0 and r, there exists a vector with that round number. 

Proof: Initially all vectors have round number 0, and each call to advance 
increases the maximum round number by at most one. I 

If P sets its past vector to v, then we say that P writes v to past{Q), and 
similarly for present{Q). 

Lemma 3 If P is the first process to write r to present{Q) , then it simultane- 
ously writes r — 1 to past(Q). 

Proof: We show that if P's call to advance returns round number r, then P's 
present vector must have round number r — 1. By Lemma 2, if advance returns 
r , then round vectors exist with round number r — 1. If P's present vector has a 
round number less than r — 1, then advance would return r — 1, since no higher 
round number appears in the graph. I 

The next lemma follows directly from the definition of round numbers. 

Lemma 4 Ifv~w, then v = w. 

Let presentp and pastp denote P's present and past vectors. The two next 
lemmas follow from simple inductive arguments: 

Lemma 5 \present p [Q] — presentq[P] \ < 1 

Lemma 6 presentp [Q] — past p [Q] < 1 
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Lemma 7 Ifv<w, then v < w. 

Proof: Suppose the property is violated by vectors v and w, generated by P 
and Q, such that v < w but v > w. We first claim that one can choose v and w 
so that there is a directed edge from w to v (i.e., v[Q] > u;[P]). Choose v and w 
with a minimal-length directed path between them. If that path has the form 
w, w', . . .v, then either w < w' , in which case w and w' are closer, or w > w' , 
in which case u/ and v are closer. 

If v < w, then P's scan occurred before Q's. We will show that if P's scan 
occurs first, then there can be no directed edge from w to v (i.e. w[P] > v[Q]). 

Suppose that v was present in Q when Q performed its scan. If v is a 
present vector, then Lemma 5 implies that w[P] = v[Q] or w[P] = v[Q] + 1. If 
v is a past vector, and Q's present vector is v' , then, as before, w[P] = v'[Q] or 
w[P] = v'[Q] + l, and Lemma6 implies that either w[P] = v[Q], w[P] = v[Q] + l, 
or w[P] = v[Q] + 2. In both cases, w[P] > v[Q]. 

Suppose that v was not yet written to Q when Q performed its scan. Let 
v' be P's present vector during Q's scan, and let w' be Q's present vector 
during P's scan. (Note that P's scan is earlier, and that Q can do an arbitrary 
number of scans between P's scan and the scan in which it constructed w. 
During Q's scans, however, P's vectors remain fixed.) There are three cases to 
consider. If v'[Q] = w'[P], then v[Q] = w'[P] + 1 and w[P] = v'[Q]+l, hence 
w[P] = v[Q]. Ifv'[Q] > w'[P], then v[Q] = w'[P] and either w[P] = v'[Q] or 
w[P] = v'[Q] + l, hence w[P] = v[Q] or w[P] = v[Q] + l. Finally, ifv'[Q] < w'[P], 
then v[Q] = w'[Q] and w[P] = w'[P], hence v[Q] = w[P]. In all three cases, 
w[P] > v[Q]. I 

A round is complete if it has been written to past(Q). 

Lemmas 4 and 7 imply that we can view the consensus protocols as taking 
place in disjoint rounds, where the decision value for r — 1 is chosen before 
that of round r. A process joins consensus round r if it joins the consensus 
protocol while its present vector has round number r. P completes the pro- 
tocol successfully if it returns a preference distinct from _L, and unsuccessfully 
otherwise. 

Lemma 8 For each completed round r, some process joins consensus round r 
and completes successfully. 

Proof: Each process alternates executing the consensus protocol and writing a 
new vector to present(Q). Consider the first process to write r + l to present(Q). 
By Lemma 3, its previous present vector had round number r, hence it joined 
consensus round r. It must have completed successfully, since it could not have 
encountered any vectors with higher round numbers. I 

We now show that the latest procedure can always find a value to return: 
Lemma 9 Some leading past vector is always associated with a preference. 
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Proof: The first process to write r + 1 to present{Q) is also the first to write r 
to past(Q), and its winner field is not equal to _L (Lemma 8). I 

Lemma 10 If P successfully completes consensus round r, then the value re- 
turned by its call to latest is the decision value for round r — 1. 

Proof: Lemma 7 implies that the leading past vector is associated with the 
most recent round in past{Q). Lemma 9 implies that this round is associated 
with decision value distinct from _L, hence latest returns a value. Lemma 3 
implies that round r — 1 has been written to past{Q), so the decision value 
returned belongs to a round greater than or equal to r — 1. If the latest round in 
past(Q)is greater than r — 1, then the latest round in present(Q)is greater than 
r, and p could not have completed successfully. Therefore, the value returned 
by latest must be the decision value for round r — 1. I 

An invocation p is announced when it is written to A, it is observed by P 
when P applies it to a value as part of constructing a preference, and it is 
executed if P wins that round. 

Lemma 11 Each invocation is executed at most once. 

Proof: After the invocation is executed, and before a new invocation is an- 
nounced, the toggle bit in the invocation agrees with the corresponding toggle 
bit in future winning preferences' horizon fields, so make-prefer will skip that 
invocation. I 

Lemma 12 // an invocation is announced during round r, then it will be exe- 
cuted either in round r or r + 1. 

Proof: Suppose p is announced during round r but not executed in that round. 
Let P be the process that wins round r + 1. Since p's announcement precedes 
P's call to make-prefer, the invocation's toggle bit will disagree with the corre- 
sponding horizon bit in the preference returned by P's call to latest, and P will 
observe p. I 

Theorem 13 The protocol in Figure 5 implements a randomized wait-free lin- 
earizable read-modify-write operation. 

Proof: Executions of the consensus protocols occur as a sequence of rounds 
(Lemma 7), where each round has a unique winner (Lemma 8). Define the 
object's state at round r to be the value field of the winning preference for that 
round. The object's state at round r is constructed by applying a sequence 
of invocations to the object's state at round r — 1 (Lemma 10). The resulting 
execution is equivalent to a sequential execution in which operations are ordered 
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by the round number in which they were executed, and operations in the same 
round are ordered by in the order they were observed by that round's winner. 
Every invocation is executed after at most two rounds (Lemma 12), and every 
invocation is executed exactly once (Lemma 11). If one operation starts after 
another returns, then the later operation will have a higher round number, and 
therefore the sequential order is a valid linearization order. 

Finally, the protocol is randomized wait-free, since each execution of the 
consensus protocol is randomized wait-free, and the encompassing protocol ter- 
minates after a fixed number of steps. I 

7 Discussion and Related Work 

We remark that the universal protocol given here can be optimized when some- 
thing is known about the function /. For example, Figure 7 shows an optimized 
implementation of compare&swap 1 This implementation returns immediately 
if the object's state fails to match the old argument, and it eliminates the re- 
sponse field by having each winner return true and each loser return false. A 
read operation is even simpler: it just returns \atest(Q). value. 

Fischer, Lynch, and Paterson [14] show that there is no consensus protocol 
for two processes that communicate by asynchronous messages. Dolev, Dwork, 
and Stockmeyer [12] and Dwork, Lynch, and Stockmeyer [13] give a compre- 
hensive analysis of the circumstances under which consensus can be achieved 
by message-passing. Ben-Or [9] proposes a randomized consensus protocol with 
exponential expected running time that tolerates up to n/5 failures, where n is 
the number of processes. Loui and Abu-Amara [22] give several consensus pro- 
tocols and impossibility results for processes that communicate through shared 
registers with various read-modify- write ("test-and-set") operations. Chor, Is- 
raeli and Li [11] give two randomized consensus protocols for shared read/write 
registers, one for two processes, and one for three processes. These protocols, 
however, run against a weaker adversary than the others cited here. 

The first shared-memory protocol that runs against a strong adversary is 
due to Abrahamson [1]. This protocol has exponential expected running time. 
The first polynomial-time protocol is an unbounded protocol due to Aspnes and 
Herlihy [5]. This protocol introduced the use of random walk, the basic tech- 
nique behind all known polynomial-time protocols. Attiya, Dolev, and Shavit 
[7] show how to eliminate unbounded counters from the original random walk 
algorithm, and Saks, Shavit, and Woll [27] show how to make the protocol fast 
when processes run in approximate synchrony. Bar-Noy and Dolev [8] adapt the 
random walk protocol to a message-passing model, yielding the fastest known 
consensus protocol for that model. 

1 This procedure implements a slightly restricted compare&swap in which the old and new 
arguments must be distinct. 
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The author [17] has shown that n-process consensus is universal in a sys- 
tem of n processes: given a synchronization primitive that solves n-process 
consensus, one can construct a deterministic wait-free implementation of any 
object. Plotkin [26] also gives a universal construction using a particular read- 
modify-write primitive called a sticky-bit. The author [15] gives a more more 
time- and space-efficient universal construction using read, write and the well- 
known compare&swap instruction. If the shared memory provides only read 
and write operations, then Herlihy and Aspnes [6] have shown that one can con- 
struct a deterministic wait-free implementation of any object whose operations 
either commute with or overwrite one another. The author [16] gives a more 
general characterization of the objects that do have deterministic wait-free or 
non-blocking implementations in this model. 
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[— r, r] is the range of the counter 
m is any integer greater than 2r + 1 

inc(C) 

v := C[P] 

v. value := v.value+1 (mod m) 
C[P] := v 
end inc 

dec(C) 

v := C[P] 

v. value := v. value— 1 (mod m) 
C[P] := v 
end dec 

read(C) returns(integer) signals(quit) 
c := scan(C) 

g := leadership graph from c 
p := P's most recent vector in g 
if p 0 leader(g) 

then signal quit 

end if 
v := 0 

for Q in 0 ... n — 1 do 

q := Q's most recent vector in g 
if q £ leader(g) 

then v := v + c[Q]. value 
end if 
end for 
return v' where — r <v'<r 

and v' = v (mod m) 

end read 



Figure 5: Recycling Counter Implementation 
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advance(C7, P) returns(round-vector) 
r := new round-vector 
g := scan(Cy) 

p := P's most recent vector in g 
for Q in 0..n — 1 do 

q := Q's most recent vector in g 

select 

case p[Q] < q[P] do r[Q] := q[P] 
case p[Q] = q[P] do r[Q] := p[Q] + 1 
case p[Q] > q[P] do r[Q] := p[Q] 
end select 
end for 

return r 

end advance 



Figure 6: The Advance Procedure 



RMW(/: function) returns(boolean) 
toggle := -i A[P]. toggle 
A[P] := [toggle: toggle, function: /] 
for i in 1..2 do 

last := latest (Q) 

prefer := make-prefer(last. winner) 
decision := consensus(prefer) 
G[P] '■= [past: ^[P]. present 
winner: decision, 
present: advance(Cy, P), 
counters: [0, . . . , 0]] 

end for 

return latest(Cy) .winner. response[P] 
end RMW 



Figure 7: Read-Modify- Write 
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make-prefer(old: preference) returns(preference) 
new := copy(old) 
a := scan(^l) 
for Q in 0 ... n — 1 do 

if a[Q]. toggle ^ old.horizon[Q] 

then new.horizon[Q] := -i new.horizon[Q] 
new.response[Q] := new. value 
new. value := a[Q].function(new. value) 
end if 
end for 
return new 
end make-prefer 



Figure 8: The Make-Prefer Operation 



compare&swap(old, new: value) returns(boolean) 
for i in 1..2 do 

last := latest (Q) 
if last. value ^ old 
then return false 
end if 

prefer := [value: new, horizon: last. horizon] 
toggle := -i last.horizon[P] 
prefer. horizon[P] := toggle 
decision := consensus(prefer) 
Q[P] := [past: Q[P]. present, 
winner: decision, 
present: advance(Cy, P)] 
if latest(Cy). winner. horizon[P] = toggle 
then return true 
end if 
end for 
return false 
end compare&swap 



Figure 9: Compare&Swap Construction 



